A Wikipedia-based Corpus for Contextualized Machine Translation
نویسندگان
چکیده
We describe a corpus for and experiments in target-contextualized machine translation (MT), in which we incorporate language models from target-language documents that are comparable in nature to the source documents. This corpus comprises (i) a set of curated English Wikipedia articles describing news events along with (ii) their comparable Spanish counterparts, (iii) a number of the Spanish source articles cited within them, and (iv) English reference translations of all the Spanish data. In experiments, we evaluate the effect on translation quality when including language models built over these English documents and interpolated with other, separately-derived, more general language model sources. We find that even under this simplistic baseline approach, we achieve significant improvements as measured by BLEU score.
منابع مشابه
Wikipedia as an SMT Training Corpus
This article reports on mass experiments supporting the idea that data extracted from strongly comparable corpora may successfully be used to build statistical machine translation systems of reasonable translation quality for in-domain new texts. The experiments were performed for three language pairs: SpanishEnglish, German-English and RomanianEnglish, based on large bilingual corpora of simil...
متن کاملAutomatic Building and Using Parallel Resources for SMT from Comparable Corpora
Building parallel resources for corpus based machine translation, especially Statistical Machine Translation (SMT), from comparable corpora has recently received wide attention in the field Machine Translation research. In this paper, we propose an automatic approach for extraction of parallel fragments from comparable corpora. The comparable corpora are collected from Wikipedia documents and t...
متن کاملCS671A Natural Language Processing Hindi ↔ English Parallel Corpus Generation from Comparable Corpora for Neural Machine Translation
Neural Machine Translation (NMT) is a new approach to the well-studied task of machine translation, which has significant advantages over traditional approaches in terms of reduced model size, and better performance. NMT models require a parallel corpus of significant size to be trained, which is lacking for the Hindi ↔ English language pair. However, significant amounts of comparable corpora a...
متن کاملLearning to Simplify Sentences Using Wikipedia
In this paper we examine the sentence simplification problem as an English-to-English translation problem, utilizing a corpus of 137K aligned sentence pairs extracted by aligning English Wikipedia and Simple English Wikipedia. This data set contains the full range of transformation operations including rewording, reordering, insertion and deletion. We introduce a new translation model for text ...
متن کاملTopic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus
We propose a flexible and effective framework for extracting a bilingual dictionary from comparable corpora. Our approach is based on a novel combination of topic modeling and word alignment techniques. Intuitively, our approach works by converting a comparable document-aligned corpus into a parallel topic-aligned corpus, then learning word alignments using co-occurrence statistics. This topica...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014